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ABSTRACT 

Who are the influential people in an online social network? 
The answer to this question depends not only on the struc- 
ture of the network, but also on details of the dynamic pro- 
cesses occurring on it. We classify these processes as conser- 
vative and non-conservative. A random walk on a network 
is an example of a conservative dynamic process, while in- 
formation spread is non-conservative. The influence models 
used to rank network nodes can be similarly classifled, de- 
pending on the dynamic process they implicitly emulate. 
We claim that in order to correctly rank network nodes, the 
influence model has to match the details of the dynamic 
process. We study a real-world network on the social news 
aggregator Digg, which allows users to post and vote for 
news stories. We empirically define influence as the num- 
ber of in-network votes a user's post generates. This in- 
fluence measure, and the resulting ranking, arises entirely 
from the dynamics of voting on Digg, which represents non- 
conservative information flow. We then compare predictions 
of different influence models with this empirical estimate of 
influence. The results show that non-conservative models 
are better able to predict influential users on Digg. We flnd 
that (normalized) a-centrality metric turns out to be one 
of the best predictors of influence. We also present a sim- 
ple algorithm for computing this metric and the associated 
mathematical formulation and analytical proofs. 
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1. INTRODUCTION 

Online social networks have become important hubs of so- 
cial activity and conduits of information. Popular social net- 
working sites such as Facebook, the social news aggregator 
Digg, and the microblogging service Twitter have undergone 
explosive growth. Though some research suggests that peo- 
ple are more affected by the opinions of their peers than in- 
fluentials 15 16], recent studies of online social networks [9] 



support the hypothesis that influentials exert disproportion- 
ate amount of influence. With the numbers of active users 
on these sites numbering in the millions or even tens of mil- 
lions, identifying influential users among them becomes an 
important problem with applications in marketing 27 , in- 
formation dissemination 
discovery 
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(20, 32 , search ilj, and expertise 



While many influence models and centrality measures have 
been proposed to rank actors within a social network, almost 
all of them make implicit assumptions about the underlying 
dynamic process occurring on the network 5 , which may 
not be applicable to online social networks. Since dynamic 
processes can often be directly observed on online social net- 
works [31 30 , these networks provide us with a unique op- 
portunity to study influence. In this paper we address the 
question of which influence model is most suitable to predict 
the influence standings of users within online social networks 
whose main function is to disseminate information. 

We classify dynamic processes, or flows, that can occur on 
social networks as conservative or non-conservative. We de- 
flne a flow to be conservative ("transfer" in Borgatti's [5] 
terminology) if the initial mass or content of the network is 
equal to the final mass after the flow has taken place. For 
example, consider that each user Ui in a social network has 
some amount of money rui , some fraction of which she can 
transfer to any of her friends. The total amount of money 
in the network remains constant (J]] tti^ — c) at all times. 
Hence, the money exchange process within a social network 
is conservative. We deflne a flow to be non-conservative 
("parallel" or "serial" duplication in Borgatti's terminology) 
if the initial mass of the system is not equal to the final 
mass after the flow has taken place. Information flow is a 
non-conservative process. When a user posts a new item 
on Facebook, Digg or Twitter, she broadcasts this informa- 
tion to her social network. Each of her social links may in 
turn broadcast the information to their own social networks, 
thereby continuing the parallel duplication process. Note 
that this process is somewhat different from gossip, which is 
one-to-one, or serial duplication of information. Both types 
of information spread in online social networks, however, are 
non-conservative processes. 

Influence models too, can be categorized as conservative and 
non-conservative. Though these models take only network 
structure into account when measuring importance or cen- 
trality of an actor within the network, they make assump- 



tions about the details of the underlying dynamic process 
taking place on the network. For example, PageRank al- 
gorithm Is], commonly used for network analysis, models a 
conservative diffusion process; therefore, it may not be an 
appropriate ranking metric for online social networks. In 
order to correctly rank influential nodes in a network, the 
influence model has to match the details of the dynamic 
process. 

This paper makes three contributions. In Section [2] we re- 
view many of the existing influence models found in litera- 
ture and categorize them according to whether they model 
a conservative or a non-conservative dynamical process. In 
Section Is] we study a real- world network on the social news 
aggregator Digg. Digg allows users to post and vote for news 
stories, and also to create networks in order to track what 
new stories their friends posted or voted for. We define an 
empirical measure of influence as the number of in-network 
votes a user's post generates, i.e., votes that come from that 
user's social network links. This influence measure, and the 
resulting ranking, arises entirely from the dynamics of vot- 
ing on Digg. In Section H] we evaluate the different influence 
models by correlating the rankings they produce with the 
rankings produced by the empirical measure of influence. 
We find that the non-conservative a-centrality [4] best pre- 
dicts the rankings of Digg users. These results corroborate 
our claim that the details of the influence model used for 
ranking actors in a network should match the details of the 
dynamic process occurring on the network. We review re- 
lated works pertaining to each section in the section itself. 

There exist many empirical studies of social behavior and 
influence on online social networks. Some of these studies, 
compare empirical measures of influence with some struc- 
tural models of influence like PageRank or in-degree cen- 
trality [30[ ^. However, there is a need to clearly differ- 
entiate between the two distinct and different methods of 
quantifying influence in online social networks: 

1. Measurements of online social behavior or the dynamic 
processes occurring on a social network to estimate 
influence. 

2. Using influence models based on the structural prop- 
erties of the underlying social network to predict influ- 
ence. 

Empirical estimates of influence measured from online social 
behavior, do not have the predictive capabilities of the struc- 
tural models of influence. To the best of our knowledge, ours 
is the first work that evaluates predictive influence models 
based solely on structural properties of the underlying so- 
cial network, using the actual dynamic process occurring on 
a real-world network; unlike existing works, which simulate 
the underlying dynamic process [5| |28l . 

We also provide a simple method to calculate the (normal- 
ized) a-centrality and provide a mathematical validation and 
analytical proofs for this method. 

2. INFLUENCE MODELS 

A network of n actors and m links can be represented as a 
graph G{V,E) of Vi\V\ = n) nodes and E{\E\ = m) edges. 
Each actor is represented as a node. An edge exists between 



node i and j if actor i is linked to actor j. The edges might be 
weighted to exhibit the strength of the links. The geodesic 
distance from i to j is gd{i,j). Let d]" be the in-degree of 
node i. Let d""* be the out-degree of node i. If there exists 
a directed edge eij from i to j, we say that i is a fan of j and 
j is a friend of i. Let A = (Aij) be the adjacency matrix 
of the corresponding network, whose maximum eigenvalue 
is Ai and whose maximum out- and in-degrees are d'^^^ and 



Below we review existing influence models found in litera- 
ture and categorize them as conservative or non-conservative 
according to the flow that they model. 

2.1 Geodesic Path-based Ranking Measures 

All geodesic path-based ranking methods assume that net- 
work flow is conservative in nature. Moreover, these meth- 
ods assume a binary flow. We deflne a binary dynamical 
process, D, as follows: 

• There exists some initial mass at some node i {nii = 
Md) at time io- 

• At any time, she may either transfer the entire mass 
to a single neighbor k (ruk = Mu) or keep it to herself 
{mj = Md). 

• The length of a path traversed while moving from node 
i to node j via edge Cij Wcij 7^ is equal to the weight 
of the edge from node i to node j. 



Closeness Centrality. Let node i generate a sequence of 
binary processes (-Diji , Dij^ , • • • , Dij^, ,■■•)) with the objec- 
tive of process Dij^ being to transfer the initial mass Moij 
at node i, to another node jp {jp 7^ i). For binary flow pro- 
cess Dij^ , with the initial mass at node i {nii — Mr>ij , mt = 
0,Vfc 7^ i), the geodesic distance gd{i,ji) is the shortest dis- 
tance traversed by the mass in reaching destination node ji . 
When this mass reaches ji, node i generates another flow, 
Aj2 {mi = MDij^,mj^ = MDij^,mk = Vfc 7^ i, ji), with 
the objective that the initial mass Md^ ■ , be transferred to 
another node J2 (ji 7^ J2, i), which does not have any mass. 
Continuing with this sequence of binary flows, when mass 
M_D. . reaches node jp-i, node i generates another flow, 
Ajp {m^ = MDi^^,mj^ = Mo.^^yjk ^ i,k < p,mj^ = 
OVjft 7^ i,k > p), with the objective that the initial mass 
Af_D. . , be transferred to another node jp {jp -^ i,jk^k < p), 
which does not have any mass. This process is terminated 
when every node connected to node i has some mass. Close- 
ness centrality of node i is inversely proportional to the 
shortest total distance traversed by all masses transferred 
from node i to some connected node j, when this sequence 
of binary processes terminate. 

There are different deflnitions of closeness centrality in lit- 
erature. Hakimi [22] and Sabidussi [35] defined closeness 
centrality as 

In order to discount network size, Wasserman and Faust [37] 
modified the definition of closeness centrality to 

n-1 



Cc(i) 



I21^i9d{i,j) 



(2) 



These closeness centrality measures implicitly assume that 
the underlying social network is strongly connected. How- 
ever, this assumption does not hold most real-life network. 
Therefore, Lin [34l redefined closeness centrality using the 
number of nodes reachable from node i, J^: 
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Graph Centrality. Let node i generate a binary flow Di 
with the objective to transfer mass Moi to a node j at 
largest geodesic distance from it. Graph centrality is in- 
versely proportional to the distance traversed by mass Md^ , 
to move from node i to node j. Formally, graph centrality 
is defined as 12 11: 



(generated by node i) at time ifc. Consider the following 
conservative flow Di: 

• The initial mass is at node i, m^° = Moi at time io- 

• At time ti, i transfers a fraction Pij of mass Mrn to 
her friend j (rji*^ — PijMoi)- She may also retain 
some fraction of mass I]fc=ijgfriend(i) ^^^ = ^■ 

• Similarly, at any time f ^ , node j transfers some fraction 
Pjp of its mass to her friend (p £ friend(j)) and keeps 
some fraction of the mass to herself. 



This flow conserves the total mass in the system at any 
given time, 5I],gv "^ii ~ ^Oi- The expected number of 
Markov processes generated at i reaching node j at time tj, 
is equal to m-'j . Hence the flow underlying a Markov process 
is conservative. 



C,(i) 



ma.Xj(zv\{i}gd{i,j) 



(4) 



Betweenness Centrality. Let every node j generate a se- 
quence binary flows (Djk) with the objective to transfer 
mass Md^ to node k {k £ V \ {JY). Let each of these 
processes take the shortest route, i.e., the path of transfer 
of mass Md k from j to k is the geodesic path gd{j, k). Be- 
tweenness centrality of node i is proportional to the total 
number of times a given node i is traversed by all these pro- 
cesses (excluding processes that start or end at i) . Formally, 
betweenness centrality is defined as [Ts] 



ct{i)^ Y, 



ly^j^k 






(5) 



where aj^ is the number of geodesic paths from j to k and 
cr-jkii) are the number of shortest paths from j to k which 
traverse i. 

2.2 Topological Ranking Measures 

The topology of a network is characterized by the inherent 
structural properties of the nodes and edges comprising it. 
In case of a node, it includes the node's in- and out-degree. 
In case of an edge Cij, it includes the out-degree of node i 
and in-degree of node j. The geodesic path- based ranking 
measures do not take the topology of the network into ac- 
count. This is due to the binary nature of the underlying 
process, i.e., either the entire mass is transferred from node 
j to a single neighbor j or it is not transferred at all. This 
transfer is independent of the number of neighbors of i and 
j. However, there do exist topological ranking measures, as 
described below. 

2.2. 1 Markov Process-based Ranking Measures 

Markov processes describe a broad class of random pro- 
cesses, including random walks and diffusion processes. In 
a Markov process, the probability of transfer from node i to 
j depends only on the state of i, which is described by the 
topology of node i. 

Let P — (Pij ) define the transition matrix of a Markov pro- 
cess. Let M_D. be the number of Markov processes generated 
at node i. Let m^'' be the mass in node j from process Di 



PageRank. Let 
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= otherwise 



(6) 



Let each node i generate a flow Di. Each flow terminates at 



time tk. such that rUi 



^^ii' ^*5 i £ ^- Let rrij 



E, 



be the mass at node j when all these processes 
- E,gv-^o,- a is called the 
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terminate where "^ 

damping factor. PageRank ranks node i proportional to the 

mass rrii at node i at the end of these processes [s]. 



Cpr,a{i) = {l-a) + a y~^ 

Cpr,a = (1 — a)e + aCpr,aP 



^pr,OL yj ) 



(7) 

where Cpr,a is the (1 x n) PageRank vector, e is the (1 x n) 
unit vector and P is the {n x n) transition matrix. Using a 
personalization vector v fTI instead of e, we have personal- 
ized PageRank as 



Cpr,ci = (1 — a)v + aCpr,ciP 



(8) 



Hubbels Model. A very similar model was proposed by 
Hubbel in 1965 M: 



Cl=v^ + PCl (9) 

where Cu is Hubbel's ranking vector. 



2.2.2 Degree Centrality 

Consider a non-conservative flow Di as follows: 

• The process consists of a single time interval [to,ti]. 

• If the process is generated at node i who has mass 
rUi = Moi , at time to , this mass is duplicated to all 
her fans at time t\. Therefore the total mass of the 
system at time ii = dl^Moi which is proportional to 
in-degree centrality of i. 

Let each node i £ V start a flow. The in-degree centrality 
of node i is C^in(i) = dl". 



Similarly if we consider another non-conservative flow Dj as 
follows; 

• The process consists of a single time interval [to,ii]- 

• If the process is generated at node j who has mass 
ruj — Mdj, at time to, this mass is duplicated to all 
her friends at time ti . Therefore the total mass of the 
system at time ti = d°'^^ Mdj which is proportional to 
out-degree centrality j. 

Let each node i £V start a flow. The out-degree centrality 
of node i is Caout{j) = d°"*. 

2.2.3 Path-based Ranking Measures 

Consider a non-conservative flow Di defined as follows: 

• Di is generated by node i at time to , mf- = Mjj . . 

• At time fi, let a^J be the fraction of mass at i du- 
plicated by j where j is a friend of i. Therefore the 
mass at j at time ti due to process Di is m^^ where 

• Similarly let a J', be the fraction of mass at q duplicated 
by friend j, at time tk- Therefore the mass at j at time 

tk is Ep^o"^*" where m*; = E,e/ans(i) «^i"^*r'' 
p> 0. 

ot-Centrality. Let each node i generate a flow Di described 
above. If these non-conservative flows persist for a long time 
with with Moi ~ Vi and a^'^ — aip, then the mass at i, 
is proportional to the rankings given by a-centrality. 



tk- 



Formally, a-centrality deflned by Bonacich H] is: 

Calpha.a ^ V -\- aCalpha,aA = v{I — aA)~ 



(10) 



where Caipha,a is the a-centrality vector, a-centrality can 
also be written as: 



Va € (r%^, IKCa, „^ 1 = Cn„) and is independent of a. 

We further show that when |Ai| is strictly greater than any 
eigenvalue, lim„, , i C]v„ a exists and lim„ , i Cn„ a = 

Cjv„ = CjY ^^ 1 . The details of the mathematical formu- 

"' 1-^1 
lation of this metric and the associated proofs are given in 

the Appendix. 



KatZ Score. If « = aeA, a-centrality reduces to Katz score 
[261. 



Ck,a = aeA{I - aA) ^ 



(14) 



SenderRank. SenderRank 28 has a similar flavor being 
defined as: 



CsR,, 



= (l-a)(/-ayl)"'e^ 



(15) 



where CsR,a is the SenderRank vector. 



Eigenvector Centrality. Eigenvector Centrality ^3^ is given 
by: 



CE{i) 



1 " 



Ce = 



Ai 



Ai 



j=i 



CeA 



(16) 



where Ce is the eigenvector centrality vector since it is equal 
to an eigenvector of A (corresponding to Ai). Most real-life 
networks such as online social networks have asymmetric re- 
lations. Bonacich [4| showed that the eigen- vector centrality 
approach does not work well for asymmetric relations. 
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where Cc,,k ~ E t=n Q*^* is the a-centrality matrix. Hence, 
for equation 1 10| to hold, the series {Co,,k} must necessarily 
converge as k approaches infinity. This happens if and only if 
\a\ < -r^. Therefore a-centrality can only be calculated for 

\a\ < 



|Ai|- -" 



Bonacich states that a "reflects the relative 
importance"of endogenous versus exogenous factors in the 
determination of centrality." Following Katz 26 , we call a 
the attenuation factor. 



Normalized a-centrality. In this paper we define normal 
ized a-centrality as: 



CNo, 



T.I, (G 



Q:,fc— >oo j^ 



(13) 



As stated above, the value of a is subject to the constraint 
|a| < jy-r. We show that computation of normalized a- 
centrality is not bounded by this constraint (Section |4.1[ ). 
However, rankings given by normalized a-centrality are equal 
to the rankings given by a-centrality for |a| < ^^. We also 
show that value of normalized a-centrality remains the same 



3. EMPIRICAL ESTIMATE OF INFLUENCE 

We study a real-world online social network on Digg. Digg 
is a social news aggregator that enables users to collectively 
moderate stories they find online by submitting them to 
Digg and voting for them. Digg promotes the best stories 
(i.e., stories that receive many votes) to its front page. In ad- 
dition, Digg allows users to create social networks by adding 
as friends, users whose activities they want to track. Using 
the Friends Interface, a user can see the stories her friends 
recently submitted or voted for. 

We rank Digg users according to the centrality measures de- 
fined above and compare these rankings with those produced 
by the empirical estimate of influence. Our objective is to 
find the infiuence model that best predicts influential users 
in this network. 

3.1 Data Collection 

We used Digg API to collect data about 3,553 stories pro- 
moted to the front page in June 2009. The data associated 
with each story contained story title, story id, link, submit- 
ter's name, submission time, list of voters and the time of 
each vote, the time the story was promoted to the front page. 
In addition, we collected the list of voters' friends. From this 
information, we were able to reconstruct the network of Digg 
users who were active during the sample period. Borrowing 



the concept of active user from media research [33] ; we define 
an active user as a person who votes in at least one story. 
Next, we get the connections between the active users. We 
say user a is connected to user b if he is either a friend or fan 
of user b. We store an active user in active users network if 
he is connected to one or more active users. In our dataset, 
there are 139,410 distinct voters who have voted on at least 
one story. Out of these users, 69,524 voters are connected to 
one or more active users and hence are members of the active 
users network. These 69,524 connected users form the un- 
derlying friendship network. Of these 57,908 users form one 
giant connected component. The diameter of the network 
(length of the longest shortest path) is 16. Thus we observe 
empirically, that the network exhibits small world phenom- 
ena [36] [14] since the diameter of the network is 0{log n) 
(n=69,524y! 572 of the 587 distinct submitters belong to 
this friendship network. 

3.2 Estimation of Influence 

When a user (submitter) posts a story on Digg her fans are 
able to see the story. Some of these fans will like the story 
and vote for it. The story will then become visible to their 
own fans, who may themselves choose to vote for the story, 
and so on. Therefore, the underlying dynamic process of 
information spread on the network is non-conservative in 
nature. 

Assume that user i posts a story. If a link Cji exists from 
j to i, then user j is a fan of user i and is watching her 
activities. In other words, when i posts a story, j is able to 
see it through Digg's Friends Interface. If j also votes for 
the story, we call her vote a fan vote. The probability that 
a submitter's fan votes on a story depends on 

1. the influence of the submitter 

2. the quality of the story 

We assume that story quality is a random variable, uncor- 
related with the submitter. Therefore, we can average out 
the contribution of story quality to submitter's influence, by 
aggregating fan votes over all stories submitted by the same 
user. 

Let A'^ be the total number of users in the network (A'^ = 
69, 524) and K be the number of fans the submitter i of story 
Si has. Let k be the total number of fan votes that story Si 
receives within the first n votes. We set n — 100, calculat- 
ing the number of fan votes within the first 100 votes. The 
stochastic process of voting is described by the urn model, 
in which n balls are drawn without replacement from an 
urn containing N balls in total, of which only K balls are 
white. The probability that k of the first n votes are from 
submitter's fans purely by chance is equivalent to the prob- 
ability that k of the n balls drawn from the urn are white. 
Hence, the probability that X = k oi the first n votes are 
from submitter's fans, P{X — k\K,N,n), is given by the 
hypergeometric distribution: 



P{X ^k\K,N,n) = 
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Figure 1: (a) The scatter plot shows the average 
number of fan votes received by a story within the 
first 100 votes vs submitter's in-degree (number of 
fans). Each point represents a distinct submitter. 
The line gives the expected number of fan votes 
given the in-degree, w^hich can be approximated 
(r^ — 0.75) by a Weibull cumulative distribution, (b) 
This plot shows the probability of the expected num- 
ber of fan votes being generated purely by chance. 
The inset zooms in for in-degree less than 100. 



Of 3552 stories in our data set, 3489 were submitted by 572 
users within the social network. Of these, 504 submitters 
received at least one fan vote in the first 100 votes for one 
of their stories. These 504 users submitted a total of 3396 
stories. For each of the 504 distinct submitter having fan 
votes in top 100 votes, we plot the average number of fan 
votes received by stories submitted by these 504 users vs 
user's in-degree in Figure [l] This scatter plot is approxi- 
mated by the Weibull cumulative distribution (r^ = 0.75), 
(fc) = 65(1 - e-to-ooii^+o-ooos)"'"). We use this expression 
to estimate the expected number of fan votes (A:) within 
the first 100 votes for a user with in-degree [K) (N=69524, 
n=100). Using [T7| we then calculate the probability that (fc) 
fans voted purely by chance. As can be clearly seen in the 
lower plot in Figure [I] for K > 10, the probability that (fc) 
of submitter's K fans voted purely by chance is exceedingly 
small {P < 0.00038), and therefore, highly unlikely. We con- 
clude that averaging fan votes over all stories submitted by 
a given user is an effective indicator of her influence (given 
she has at least 10 fans). 

In our dataset, in order to mask the effect of story qual- 
ity, we consider only those users who submitted at least two 



stories. There were 289 distinct submitters with more than 
two stories which received at least one fan vote within the 
first 100 votes. All these submitters had more than 10 fans. 
We use the average number of fan votes that stories sub- 
mitted by these users received (within the first 100 votes) 
as an indicator of their influence. We then rank submitters 
according to this empirical measure of influence. 

Influence can be similarly measured in other social networks. 
For instance, in (9| [SO], influence in Twitter is measured 
for the dynamic process of information propagation using 
retweets and mentions. However, these studies have not 
proven the statistical significance of the measures employed. 

4. COMPARISON OF INFLUENCE 
MODELS 

In most situations, data detailing the history of a dynamical 
process on a network is not available; therefore, calculating 
an empirical estimate of influence is not feasible. Instead, 
many approaches were developed to identify important or 
influential actors solely using the structure of the network. 
Some attempts have been made to analyze these approaches 
by simulating the underlying dynamical processes ^5, 28]. 
We on the other hand, evaluate these influence models by 
comparing them with the empirical measure of influence ob- 
tained from the analysis of the actual dynamic process of 
information diffusion on an online social network. Since in- 
formation propagation is a non-coservative process, we hy- 
pothesize that non-conservative models will best predict the 
influentials within the network. 



4.1 Calculating Influence 

It can be easily shown that a-centrality is a generalization of 
Katz score (equation |14[ ). The computation of a-centrality 
is restricted to a < j^— r j4j. However, computing an eigen- 
value Ai, of the network adjacency matrix A is a costly and 
time consuming process. Besides, the value of rj-r turns 
out to be very small in most large networks. In this pa- 
per, present a simple algorithm to calculate the normalized 
a-centrality, which is not bounded by this tight constraint. 
However, for a given value of a < tj~1' ^^® rankings given 
by normalized a-centrality are equal to the rankings given 
by a-centrality (Section [t] theorem [T]). As a is increased 
(a > Tj-i), Cncc converges to Cn^ which is independent 
of a (Section It] theorem [2|. A simple algorithm for com- 
puting normalized Q-centrality as a is varied, using dynamic 
programming, is given below. 

Algorithm [l] does not depend on the value of Ai. Since 
normalized a-centrality varies with a only for a < t^, in 
order to study this variation, we may choose a step size of 
s — . ,.„„? — 37^; — r, where c < 1 is a constant. This is 
because, using the Gershgorin circle theorem, we know that 



Algorithm 1 Normalized a-centrality 



|Ai| < min(d° 

As can be seen in algorithm fT] in each iteration, for a given 
value of a, Cj^^ ^ depends only on Clf^^^ ^-nd A. Consider- 
ing the network comprises of n actors and m links between 
them, in a naive implementation of algorithm [l] each itera- 
tion has a runtime complexity of 0{m) and space complexity 
of 0{m + n). Assuming that the main memory just large 



Input 

A: Adjacency matrix 

v. Personalization vector 

s: Step size (can be modifled depending on the granularity 

of the results desired) 

k: Maximum number of iterations in each step (Equation 

m- 

e: Tolerance 

Output 

{Cjv„,c, :at G [0,1]} 

Initialize 

t,i-i^ 
Ol -f- s 
repeat 
repeat 



C 



i + l 



^1 ^ " + "t + lCAra.at+i^ 



i + 1 



until CJv^,„j - C;^„^^ < e or i > fc 
t^t+l 



,Cl 



at+1 <— Qt + S 

i<- 

until CNa,at = CNa,,at_i = Cn^ Or Oj > 1 



enough to hold both C]^^ ^ and C^^^,, the i/o cost for each 
iteration is 0{m). If main memory is large enough to hold 
only C'j^^ ^ , and assuming efficient data structure such as 
a sorted link list is used to store A, i/o cost is 0{m + n). 
Since the formulation of normalized a-centrality is very simi- 
lar to that of PageRank (Equations 10 and[8|, similar block 
based strategies can be used for fast and efficient compu- 
tation of both PageRank and normalized a-centrality [23] 
[25| . Like PageRank, normalized a-centrality can easily be 
implemented using the map-reduce paradigm [l2], guaran- 
teeing the scalability of this algorithm and its applicability 
to very large datasets. Apart from normalized a-centrality 
and PageRank, we calculate the influence scores based on 
closeness centrality, graph centrality, betweenness central- 
ity, in-degree centrality, out-degree centrality and Sender- 
Rank. Analogous to a-centrality, for the other parametric 
measures of centrality, namely PageRank and SenderRank, 
we investigate the change in ranking as the value of the pa- 
rameter changes. Since this friendship network shows the 
small world phenomena (section |3.1[ ) and is unweighted, a 
fast approximation of betweenness centrality can be done 
with 0{km) run-time complexity where k — 6( '°^2" )i for 
e > [Tt]. However, we use the fast algorithm for Isetween- 
ness centrality given by Brandes 6 for the calculation of 
betweenness centrality. It has 0(n -f m) space and 0{mn) 
run-time complexity. Graph centrality and closeness cen- 
trality can be computed in very similar lines. 



An investigation into the stability of centrality measures 
when networks are sampled was carried out in [lO]. Eigen- 
vector centrality turned out to be the most robust central- 
ity followed by in-degree centrality. For symmetric cases as 
a — >■ rA-, the eigen- vector centrality rank and a-centrality 

ranks are identical when Ai is strictly greater than any other 



eigenvalue IS]. In this paper, we prove that for normalized a- 
centrality, lim^_j. j_ CNa,a exists and is equal to Cjv„ which 

is independent of a. We also show that for symmetric ma- 
trices, the rankings given by eigenvector centrality Ce are 
equivalent to the rankings given by normalized a-centrality 



C]v„ 



lim„ 



Cjv 



C 



JV„,a> I 



(given that Ai 



strictly greater than any other eigenvalue). 

4.2 Evaluation of Influence Predictions 

Next, we compare the predicted rankings using influence 
models described in Section |2] with the rankings obtained 
from the empirical estimate of influence using Pearson's cor- 
relation coefficient, since ties in rank exist. 
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SenderRank and the attenuation factor in (normalized) a- 
centrality are very similar mathematically. Therefore, with- 
out loss of generality, we represent both as a(0 < a < 1), 



(b) 



In Section It] we prove that normalized a-centrality C]v„,a 
converges to CN^'iot G (tttt,!] ( C^ a>J-(*) = Cn^H))- 

In Fig, [21 (inset) we can clearly observe this. CNo,,a (shown 
in blue) converges to Cjv^ for a > 0,007, The correlation 
of Cno, with the empirical estimate of influence is very high 
(corr, = 0,893024), 

On the other hand, we observe that though the PageRank 
score converges for every value of a, the PageRank score, 
and hence the ranking, is dependent on the value of a. Var- 
ious studies have tested different damping factors, but it is 
generally assumed that the damping factor should be set 
around a = 0,85 p], Boldi et al, [2] claim that in case of 
PageRank, "for real-world graphs values of a close to 1 do 
not give a more meaningful ranking," Except for values a 
close to 1, the influence rankings calculated from normalized 
Q-centrality correlated better with the empirical estimates 
of influence rankings than PageRank rankings. 

Correlation of SenderRank, Csr is 0,321, We also observe 
that in-degree centrality, C^in is better correlated to the em- 
pirical estimate of influence (corr,= 0,753) than out-degree 
centrality, C^jout ( corr, = 0,32), Higher in-degree implies 
greater number of fans, A bigger, connected network of fans, 
fans of fans an so on can be inferred from higher (normal- 
ized) a centrality. Since the spread of information depends 
of number of users, who can see and spread the story a user 
submitted; fans and networks of fans have a greater contri- 
bution to this spread than friends and network of friends. 
Hence (normalized) a centrality and in-degree centrality are 
better models for predicting influence than SenderRank and 
out-degree, when the underlying dynamic process is informa- 
tion propagation. There is also a high correlation between 
in-degree centrality and normalized a-centrality. For Cn^ , 
the correlation is 0,91, Our results are in agreement with a 
comparison study of centralities for biological networks [29] , 
where it was shown that the correlation between eigen- vector 
centrality and in-degree centrality was high. 



Figure 2: (a) Correlation between the rankings pro- 
duced by the empirical measure of influence, which 
computes the number of fan votes w^ithin the first 
100 votes, and rankings computed by different cen- 
trality measures, (b) Correlation between the rank- 
ings produced by the empirical measure of influence, 
which computes the number of fan votes within all 
votes, and rankings computed by different central- 
ity measures. Note that a(0 < a < 1) stands for 
the attenuation factor for normalized a-centrality 
and damping factor for PageRank and SenderRank. 
The inset zooms into the variation in correlation for 
0<a< 0,01 

Figure [2|a) shows correlation of influence rankings of the 
submitters, estimated from the average fan votes within the 
first 100 votes with their influence rankings relative to each 
other, calculated using different centrality measures. As can 
be seen from Equation W\ [To] and |15[ the characterization 
of damping factor (or restart probability) in PageRank and 



Since non-conservative flow of information on Digg is very 
different from the conservative flow underlying geodesic path- 
based ranking measures, these measures are not well corre- 
lated to the empirical estimate of influence. Correlation of 
closeness centrality, Cc [22] 
due Lin et al, (341 is 0,0555, 



35], [37] is 0,0564 and of that 
Correlation of graph centrality, 
Cg is 0,0313 and of betweenness centrality Cb is 0,1112, 

If we estimate the influence rankings of users, by taking the 
average number of fan votes in all the votes that their sto- 
ries receive, the trends (Fig, [5] (b) ) are very similar to those 
observed above (Fig, [2] (a)). All centrality measures are 
better correlated to the empirical estimate of influence thus 
obtained, as can be seen from Fig, [2] The correlation of Cjv^ 
with the empirical estimate of influence is very high (corr, = 
0,928), Again the correlation of PageRank changes with a 
and except for a very close to 1 , is less than that of (normal- 
ized) a-centrality. As in Fig,[2|(a), in-degree centrality, C^.n , 
is better correlated (corr, =0,82) than out-degree centrality. 



Qi^ (corr,=0,41). 



The correlation between C^in, and Ca 



is 0.92. Correlation of closeness centrality, Cc 
0.116 and of that due Lin et al. [Ml is 0.103. 



[22], [35], [37] is 

Correlation of 

graph centrality, Cg is 0.097 and of betweenness centrality 
Cb is 0.1657. Correlation of SenderRank Csr, is 0.407. 

Next, we predict the rankings of all 69,524 users within the 
network. We do this using the influence models described 
above. Interestingly, the top user predicted by most models 
is 'inactive'. 'Inactive' is the nomenclature used by Digg to 
denote users who are no longer active, i.e., posting or vot- 
ing on new stories. The connection between an 'active' user 
u and another user i exists even after i has become inac- 
tive. Thus 'inactive' user acts as a sink for these dangling 
links. We analyze the overall rankings of the top 100 of the 
289 submitters whose influence we have estimated empiri- 
cally (using average fan votes in the first 100 votes). Let 
emp be the set of rankings corresponding to the top 100 
of 289 submitters as determined by the empirical influence. 
Let pred be the set of corresponding rankings for the same 
submitters using an influence model. The probability that 
the top 100 of these 289 submitters {emp{i) G [1, 100]) are 
among the top 100 of the 69,524 active users as predicted 
by model pred is given by recall, R — \emp n pred\/\emp\. 
Recall for normalized a-centrality, Cnc is high (0.76). Us- 
ing in-degree centrality, Cd-^ for predictions, reduces recall 
to 0.6. For PageRank Cpr-,0.9 and betweenness centrality Ct, 
recall is 0.29 and 0.21 respectively. Recall is negligible when 
Cc,Cg, Cda^t and Csr are used for prediction. 

The results corroborate our hypothesis that, since the un- 
derlying non-conservative dynamic process of (normalized) 
a-centrality, most closely resembles the dynamic process of 
information propagation in Digg, (normalized) a-centrality 
is a better predictor of the influential users on Digg, than 
other influence models. 



5. CONCLUSION 

In this paper we emphasize the need to distinguish between 
different dynamic processes occurring in complex networks 
based on their distinct characteristics. Specifically, we cate- 
gorize such processes into conservative and non-conservative, 
based on the nature of the flow. Further, we classify struc- 
tural models which predict the influence standings of ac- 
tors within a network into conservative and non-conservative 
models based on the underlying dynamic process that these 
models emulate. We stress that to get the best predictions of 
influence within an network using a influence model, the im- 
plicit underlying dynamic process of the model should have 
a close correspondence to the actual dynamic process taking 
place in that network. 

Online social networks, provide us with a unique opportu- 
nity to study the continuously evolving dynamic processes 
within these networks. Here, we analyze information flow 
on the social news aggregator Digg. We hypothesize that 
such a process is non-conservative in nature. Hence to best 
predict the influential people within this network, we need a 
non-conservative influence model. The ability to observe the 
actual dynamic process occurring on Digg, allows us to get 
an empirical estimate of influence within it. We prove that 
this estimate of influence is statistically significant. Using 
this empirical infiuence measure enables us to evaluate the 
predictions of different influence models. To the best of our 



knowledge, this is the first work which evaluates influence 
models based on the structural properties of complex net- 
works using the actual underlying dynamics of the network. 

As hypothesized, non-conservative models seem to perform 
better than conservative models of influence. Speciflcally, 
we observed that the non-conservative model of (normal- 
ized) a-centrality is the best predictor of influence within 
Digg, where the underlying dynamic process is information 
propagation. In this paper, we have also given a simple al- 
gorithm for computing normalized a-centrality and the an- 
alytical proofs associated with it. 

Future work would include applying similar analytical tools 
to predict influentials on other online social networks. Most 
of the existing structural models of influence, assume that 
the structure of the network remains static in the course 
of study. However, online social networks are continually 
evolving. But researchers have not yet completely under- 
stood, the true nature of evolution of these networks. In 
future, we would like to delve deeper into the study of evolu- 
tion of these networks; and apply the knowledge thus gained 
to build upon the existing prediction tools, to take into ac- 
count the continual evolution of these networks. 
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7. APPENDIX 

If A is an eigenvalue of A, then 
1 



{/ A)x = 



(18) 



Invertibility of (/ — t^) would lead to the trivial solution of 
eigenvector x( x = 0) . Hence for computation of eigenvalues and 
eigenvectors, we require that no inverse of (7 — -rA) should exist, 



Det{I A) = 



(19) 



Equation |19| is called the characteristic equation solving which 
gives the eigenvalues and eigenvectors of adjacency matrix A. 

Using eigenvalues and eigenvectors, the adjacency matrix A can 
be written as: 



XAX-i =^A,yi 



(20) 



where Jf is a matrix whose columns are the eigenvectors of A. A 
is a diagonal matrix, whose diagonal elements are the eigenvalues, 
Aii = Ai, arranged according to the ordering of the eigenvectors 
in X. Without loss of generality we assume that Ai > A2 > ■ ■ ■ > 
A„. The matrices Yi can be determined from the product 



Yi — XZiX 



(21) 



where Zi is the selection matrix having zeros everywhere except 
for element (Z;).. = 1 'l^. 

The a-centrality matrix C^.k Vo £ [0, l]is given by: 



Ca,k 



I + aA + a^A'' 
k 



' a" A" 



E"*^' 



(22) 



t=0 



The normalized a-centrality matrix is then given by: 

1 



NC^^k ■ 



'^a,k 



E (<^">fc)ij 



(23) 



As shown in ' 
and normalized < 



Equation 1121 and 1131 o-centrality 
;ed a-centrality vector is vNCc,,t 



vector is vCa,k-- 



A can then be written as 



A'' = XA''X''^ =^A*yi 



(24) 



Using Equation |24[ [22] reduces to 

n k 



Cc.k = EE°*^^^> 

i=lt=0 

" (-ir-(l-a^+iA,^+^) 
A. (_i)P.(i„„A,) ' 



(25) 



where pi = if a |Ai| < 1 and pi = 1 if a |Ai| > 1. As obvious 
from above, for equation |22| and |25| to hold non-trivially, a 7^ 

^VJ e 1,2--- ,n. 



We consider the characterization of the series { NCd k^ao }for 

«e[o,i]. 



1. Q < p^: If a < yy^ , Ca^k^oD (and NCa^k^oo ) would 
be independent of a, since 



n 



(26) 



This is because all other eigenvectors shrink in importance as 
a — >■ 41. Therefore as « — >■ , we have 



Arc 



1 



|Ai 



E(^o. 



Vi 



(30) 



2. Q < u^: The sequence of matrices {C^j^fc} would converge 
to Ca as /c — > oo if all the sequences {(Cq, jj) } for every 
fixed i and j converge to (Ca), [13]. If a < rO' '-^a,* 
converges to C^. 



OcK — tL^Q t_^oc 



NCr. 



n 



C^a 



Eij {Ca)ij 



(27) 



3. ci> -j-r— j- and k -^ oo, a^A^ dominates in the Equation 25 






(28) 



E4 



Theorem 1. The induced ordering of nodes due to normalized 
a-centrality would be equal to the induced ordering of nodes due 
to a-centrality for a < -^^-1 ■ 



Proof. Since Ca;pha^= vC^^k^oo andCjv^,^ = vNCck^oo, 
from equations |26| and |27| the induced ordering of nodes due to 
a-centrality {a < tt— -r) would be equal to induced ordering of 

nodes due to normalized o-centrality {a < rr^)- D 



Under the assumption that |Ai| is strictly great er than any other 
eigenvalue, a'^AjYi dominates in the Equation 25 28 



^a,k~ 



NC^,k- 



' a'^X'lYi 



-Yi 



E(^o. 



(31) 



Hence from equat 
NC 



ionlsTj 



w^e have 



-Yi 



E(^o. 



Since limj^_j_ 
therefore lim 



Since Cjv„ 



|A 



NCa.k^aa = IJm^ , 1 NCa,k-ioo 
|A + 

1 NCa fc->oo exists and 



(32) 



Yi 



lim NCa,k- 



Z-^i j V ^'ij 



c 



Na,a> T 



vNCc^k^ooy therefore, lim^ 



C'iYa 



S",(yi).r 

(33) 

= Cjv„ = 



D 



Theorem 2. The value of normalized a-centrality remains the 
same^a e {j^,l] ( Cj^ ^_ = C'n^)- 



Proof. As can be seen from equation 28 when a> jy— r and 

k — >■ oo, AfC(j fc_»oo reduces to ,^ A'' and is independent of 

a. Since normalized «-centrality, CjVq a = vNCa fe— »ooi there- 
fore, value of normalized «-centrality value remains the same 



The remaining theorems hold under the condition that |Ai| is 
strictly greater than any other eigenvalue, which is true in most 
real life cases studied. 



Theorem 3. lim , i Cjv a exists ancilim . i Cjv a 

" |Ail °" '^~*|Ai| 

Cjv„ = C'jVQ,a>-pJ-r = T" ■ (Yt)--- 
|Ail ^i,J ^-'I-'m 



iumpt 

im\ a 



Proof. Under the ass ump tion that | Ai | is strictly greater than 
any eigenvalue, Equation|27| as o — > — ^ reduces to 



C' 



— - — ,fc— >oo 
A, I 



1 — aAi 



-Yi 



(29) 



Theorem 4. For symmetric matrices, the induced ordering 
of nodes due to eigenvector centrality Ce is equivalent to the 
induced ordering of nodes given by normalized centrality Cpf^ = 

Proof. For symmetric matrices 

A = XkX-^ = XAX"^ (34) 

Therefore equation |21| reduces to 

Y, = XZ.X'^ = X,Xj (35) 

where Xi is the column of X representing the eigenvector corre- 
sponding to Ai. Hence, in case of symmetric matrices: 



Ca 



c 



vYi 



lim^ Cjv^ ,Q 



Z—ii,j \ ^ 'ij 



where ci 



E?,, (yi)ij 



civXiXf = caXf 
and C2 = civXi. 



(36) 



Since X^ corresponds to the eigenvector centrality vector Cg, 
hence for symmetric matrices, the induced ordering of nodes given 
by eigenvector centrality Ce is equivalent to the induced ordering 
of nodes given by normalized centrality Cn^, = lim^-j i Cm^ .a = 

All 

c , — "^1 n 



